#### SmartCell Reconfigurable Architecture for Low-Power Stream Processing

Cao Liang and *Xinming Huang* Embedded Computing Lab Worcester Polytechnic Institute http://computing.wpi.edu

MAPLD Conference September 15-18, 2008 Annapolis, MD



## Outline

- Introduction and Motivation
- SmartCell Architecture
- SmartCell Prototype with 64 PEs
- Benchmark Applications and Performance
- Conclusions

# Challenges

- Major driving force in embedded computing
  - Multimedia signal and image processing
  - Wireless communications
  - Military and space applications
- Design challenges:
  - Low power (power efficiency)
  - High performance
  - Flexibility (Programmability or reconfigurability)



Xinming Huang

# **Existing Computing Platforms**

- General purpose processors (GPP)
- Application specific integrated circuit (ASIC)
- Reconfigurable architecture
  - Dominated by Field Programmable Gate Array (FPGA)
- New architectures: CellBE, GPU



#### Performance, Power Efficiency

Xinming Huang

#### Coarse-Grained Reconfigurable Architecture (CGRA)

- Motivations of the SmartCell architecture
  - Coarse-grained computing operators
  - Reconfigurable interconnection
  - Domain specific, e.g. stream processing
- Bridging the gap between FPGA and ASIC



Xinming Huang

#### **Overview of SmartCell Architecture**

Computing units are tiled in a 2D structure



Xinming Huang

#### **Design of Processor Element**

- Processor Element (PE)
  - 16-bit input, 36-bit output
  - Logic, Shift, and Arithmetic operations



# Design of Cell Unit

- Include 4 PEs to form a quad structure
- Fully connected cross-bar (S\_Box) for date exchange
- Serial peripheral interface (SPI) for instruction configuration



Xinming Huang

# **On-chip Interconnection Design**

Modified CMesh On-chip Network



# Control Logic Design

- Four types of control signals
  - Program counter control
  - Datapath/delay control
  - Operation control
  - Network-on-Chip control

 Format of the instruction code (64bit/instruct)

| 64 bits/instruction code |            |                  |           |                   |             |      |  |  |  |
|--------------------------|------------|------------------|-----------|-------------------|-------------|------|--|--|--|
| # of bits                | 9          | 20               | 7         | 10                | 11          | 7    |  |  |  |
| Format                   | PC control | Datapath control | I/O delay | Operation control | NoC control | RESV |  |  |  |

#### **Configuration Structure**

#### System configuration



# Prototype Chip Design

- Implementation of a seedling SmartCell system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs
  - RTL level design and simulation
  - FPGA prototyping
  - Standard cell ASIC implementation with TSMC .13
    µ m technology
  - Total area is about 8.2 mm<sup>2</sup>
  - Runs up to 107 MHz
  - Configuration time is within 12 µ s

### SmartCell Features

- A combination of the following features makes SmartCell a unique approach in CGRA families
  - Dynamic reconfiguration
  - Deep pipeline and parallelism
  - Hardware virtualization
  - Explicit synchronization
  - Unique system topology



#### **Application Domain and Benchmarks**

| Application<br>Domain                 | Test Benches                                                                                                          |  |  |  |  |
|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Signal processing                     | 64-tap FIR<br>64-tap IIR                                                                                              |  |  |  |  |
| Multimedia<br>and image<br>processing | 32-point FFT<br>8*8 2D-DCT,<br>8 by 8 Motion Estimation (ME) in 24 by 24<br>searching area                            |  |  |  |  |
| Scientific<br>computing               | 128 by 128 Matrix Multiplication (MMM),<br>64 <sup>th</sup> -order Polynomial Evaluation (PoE)<br>RC5 Data Encryption |  |  |  |  |

#### **Benchmark Mapping**

Infinite Impulse Response (IIR) filter

$$y(n) = \sum_{i=1}^{N} a_i y(n-i) + \sum_{i=0}^{M} b_i x(n-i)$$

 Biquad cascaded-IIR structure on a single Cell



Benchmark Mapping (cont') = 2D Discrete Cosine Transform (2D DCT)  $X_{i,j} = a_i b_j \sum_{k=0}^{N-1} \sum_{k=0}^{N-1} x_{k,j} \cos\left[\frac{\pi}{N}(k+1/2)i\right] \cos\left[\frac{\pi}{N}(l+1/2)j\right], \text{ where } 0 \le i, j < N$ 

Decomposed into two 1D DCTs



Xinming Huang

#### **Experimental Setup**

- Evaluation Metrics
  - Area & Timing
  - Power consumption
  - Throughput and power efficiency
  - Comparing with RaPiD, Altera's Stratix II FPGA and ASIC

| System dimension | 4 by 4               |  |  |  |
|------------------|----------------------|--|--|--|
| Design tools     | ModelSim, Synopsys   |  |  |  |
| Library          | TSMC .13 µ m process |  |  |  |
| Voltage          | 1 V                  |  |  |  |
| Simulation freq. | 100 MHz              |  |  |  |

#### Area and Power Consumption



# Power Consumption and Efficiency

- On average 156 mW power consumption @ 100 MHz
- 31 GOPS/W energy efficiency
  - only arithmetic & logic operations, excluding I/O power

| 6 . A.                                 | FIR  | IIR  | 2D-DCT | RC5  | MMM  | FFT  | PoE  | ME   |
|----------------------------------------|------|------|--------|------|------|------|------|------|
| P <sub>Dyn</sub> <mw></mw>             | 144  | 181  | 153    | 132  | 135  | 161  | 142  | 137  |
| P <sub>Core</sub> <mw></mw>            | 152  | 189  | 161    | 140  | 143  | 169  | 150  | 145  |
| E <sub>Eff</sub><br><gops w=""></gops> | 42.1 | 33.9 | 39.8   | 45.7 | 11.2 | 18.9 | 42.7 | 11.0 |

#### Compare with RaPid, FPGA, and ASIC

#### Power and system throughput comparison

 Power consumption of RaPiD has been scaled down to the same process technology of SmartCell system

|            | -                  | FIR         | IIR  | MMM               | 2D DCT            | ME               | FFT  | PoE  |
|------------|--------------------|-------------|------|-------------------|-------------------|------------------|------|------|
| SmartCell  | Power<br><mw></mw> | 152         | 189  | 143               | 161               | 145              | 169  | 150  |
|            | Throu.*            | 100         | 100  | 763               | 1.56              | 865              | 58   | 100  |
|            |                    | MS/s        | MS/s | Matrices/s        | Mblocks/s         | Kblocks/s        | MS/s | MS/s |
| RaPiD [26] | Power<br><mw></mw> | 203         | -    | 428               | 439               | 235              | -    | -    |
|            | Throu.             | 100<br>MS/s | -    | 763<br>Matrices/s | 1.56<br>Mblocks/s | 865<br>Kblocks/s | -    | -    |
| FPGA       | Power<br><mw></mw> | 725         | 896  | 445               | 787               | 573              | 431  | 628  |
|            | Throu.             | 100         | 100  | 763               | 1.56              | 865              | 100  | 100  |
|            |                    | MS/s        | MS/s | Matrices/s        | Mblocks/s         | Kblocks/s        | MS/s | MS/s |
| ASIC       | Power<br><mw></mw> | 31          | 45   | 9                 | 55                | 12               | 33   | 55   |
|            | Throu.             | 100         | 100  | 763               | 1.56              | 865              | 58   | 100  |
|            |                    | MS/s        | MS/s | Matrices/s        | Mblocks/s         | Kblocks/s        | MS/s | MS/s |

#### Compare with Rapid and FPGA

- 52% average power reduction compared with RaPiD
- 75% average power reduction compared with FPGA



#### **Power Efficiency Comparison**



\* Compare to 90nm Stratix-II FPGAs

Xinming Huang

### Conclusions

- An interesting CGRA architecture is proposed and developed – namely SmartCell
- The architecture is reconfigurable and can be targeted for different computing systems
- A prototype design with 64 PEs shows both throughput and power efficiency in benchmarks of data streaming applications
- SmartCell may have the potential to bridge the gap between high-power FPGAs and inflexible ASICs

#### The following are backup slides

Xinming Huang

# **Existing CGRAs**



Xinming Huang

#### **Characteristic Comparison**

| System    | Archit.                  | Reconfig. | Application               | Connection           | Homogeneity   | Parallel<br>& Pipeline |
|-----------|--------------------------|-----------|---------------------------|----------------------|---------------|------------------------|
| RAW       | MIMD                     | Static    | Irregular or<br>general   | 2D Mesh              | Homogeneous   | Spatial &<br>Temporal  |
| RaPiD     | SIMD                     | Mixed     | Systolic or<br>pipeline   | 1D array             | Heterogeneous | Temporal               |
| PipeRench |                          | Dynamic   | Stream-based              | 1D array             | Homogeneous   | Spatial &<br>Temporal  |
| MATRIX    | VLIW<br>SIMD<br>MIMD     | Dynamic   | Systolic                  | Layered<br>Structure | Homogeneous   | -                      |
| CHESS     |                          | Static    | Multimedia                | Hexagon<br>mesh      | Homogeneous   | -                      |
| MorphoSys | SIMD                     | Dynamic   | Data-parallel             | 2D mesh              | Homogeneous   | Spatial                |
| SmartCell | SIMD<br>MIMD<br>Systolic | Mixed     | Multimedia or<br>Systolic | Layered<br>Structure | Homogeneous   | Spatial &<br>Temporal  |

Characteristics of the compared CGRAs

#### SmartCell: Tiled Architecture, Processor Design and Interconnect



- Many cells are titled in 2D layout
- Each cell has 4 PEs (N,W,S,E)
- Simplified processor with mem
- A crossbar within the cell; onchip interconnect uses CMesh



#### **Features and Application Domain**



Xinming Huang

#### System Design and Performance

- Prototype chip design: 4x4 cells (64 PEs), .13 TSMC, 8.2mm<sup>2</sup>, 1V, about 156mW @100MHz
- Benchmark with RaPiD, Stratix-II (90nm), and ASIC



#### Acknowledgement:

- Dr. Michael Fritz, DARPA/MTO YFA Program
- Cao Liang, WPI graduate assistant, now with AMD